Skip to content

feat(page-cluster): resolve which blocking key to use per page#903

Merged
YusukeHirao merged 1 commit into
devfrom
feat/page-cluster-resolve-blocking-group-keys
Jul 3, 2026
Merged

feat(page-cluster): resolve which blocking key to use per page#903
YusukeHirao merged 1 commit into
devfrom
feat/page-cluster-resolve-blocking-group-keys

Conversation

@YusukeHirao

Copy link
Copy Markdown
Member

Summary

  • Adds resolveBlockingGroupKeys, combining the independent URL-path (derivePathGroupKey) and stylesheet (deriveStylesheetGroupKey) blocking keys added in feat(page-cluster): add URL-path and stylesheet blocking-key derivation #902 into one final grouping key per page.
  • Priority-with-fallback: prefer the stylesheet signal when it's shared by at least minCssGroupSize pages (default 2), otherwise fall back to the URL-path key.
  • Reuses computeDocumentFrequency/splitTokensByFrequency (feat(page-cluster): add frequency-based template/content token split #901) to strip stylesheets common across most of the batch (e.g. a shared reset/font file) before hashing, so two unrelated pages that only share such a file aren't wrongly merged into one group.
  • Design was checked against entity-resolution blocking literature (DNF blocking, canopy clustering, ensemble blocking) before implementation; see JSDoc for the reasoning (in particular why OR-merging the two signals would be mathematically unable to reproduce the desired splitting behavior).
  • Validated against two real crawl archives (a homogeneous single-template site and a large multi-section site) in addition to unit tests.

Test plan

  • yarn build
  • yarn lint
  • yarn test (1180 tests passing)
  • /code-review xhigh (6 findings, all fixed)
  • /qa-engineer review
  • /product-manager review
  • /doc review

Combine the independent URL-path and stylesheet blocking keys into a
single grouping key per page, preferring the stylesheet signal when it
is backed by enough shared pages and falling back to the URL path
otherwise.

Reuses computeDocumentFrequency/splitTokensByFrequency to filter out
stylesheets common across most of the batch (e.g. a shared reset/font
file) before hashing, so two unrelated pages that only share such a
file are not wrongly merged.
@YusukeHirao YusukeHirao requested a review from yusasa16 as a code owner July 3, 2026 13:34
@YusukeHirao YusukeHirao merged commit 382acce into dev Jul 3, 2026
6 checks passed
@YusukeHirao YusukeHirao deleted the feat/page-cluster-resolve-blocking-group-keys branch July 3, 2026 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant